Compiler Algorithms ---------------------------------------------------- NFA to DFA: Powerset construction. ---------------------------------------------------- DFA to Minimal DFA: Prune unreachable states and reduce to one trap state. Then: Naive: Pick pairs of states & check for L-equivalence on the pairs. Better: Partition states into L-equivalence classes & refine choices, then use Myhill-Nerode constructrion once done refining. ---------------------------------------------------- Regex to NFA: Use gadgets. ---------------------------------------------------- Removing dangling else: To match else's to innermost if's, note that in any if/else sequence, there are the matched if statements, and unmatched statements, and that the matched statements lie strictly inside the unmatched statements. That is, there is a distinguished "transition" in the sequence of nesting beyond which all if's are matched to an else. Noting such structure can help disambiguate grammars. ---------------------------------------------------- Eliminating Left recursion of CFG: A -> Aa | b becomes: A -> bA' A' -> aA' | e where e is the empty string (we use this convention throughout the remainder of the notes). In general, group all productions that begin with the variable and those that don't. ie: A -> Aa_1 | ... | Aa_n | b_1 | ... | b_m where a_i,b_i are symbol strings, and the b_i don't begin with A. Then we get A -> b_1 A' | ... | b_m A' A' -> a_1 A' | ... | a_n A' pseudo-code: Remove e-transitions. Remove cycles (rules A => ... => A). Order the rules, 1 to n. for i = 1 to n for j = 1 to i - 1 replace rules A_i -> A_j b by A_i -> a_1 b | ... | a_m b where b is a symbol tring and the a_k are A_j's productions (Essentially, replace variables that appeared earlier in the ordering if they start the present rule) Eliminate immediate left recursion Completing the inner loop changes all left recursions between A_i and A_j into immediate left recursions, where j <= i. ---------------------------------------------------- Left Factoring: Suppose we have A -> aB | aC If we are reading a token stream left-to-right, we don't know which rule to pick. We can defer the choice until later by factoring over the common prefix. We get: A -> aA' A' -> B | C The general method is an obvious extension of this. ----------------------------------------------------- Generating LL(1) grammars: FIRST(X) do If X is a terminal, return {X} If X -> Y_1 Y_2 ... Y_n is a production and Y_1 ... Y_m are e add FIRST(Y_{m + 1}) to X. If X -> e is a production, add e. until nothing is added to any FIRST sets. FOLLOW(X) Add $ to FOLLOW(S) If there is a rule A -> aBC add FIRST(C) to FOLLOW(B) if there is a rule A -> aB add FOLLOW(A) to FOLLOW(B) if there is a rule A -> aBC and e is in FIRST(C) add FOLLOW(A) to FOLLOW(B) Use the follow sets to fill in the parse table. To process input: Initialize stack with start symbol and $ Make substitutions to top of stack based on parse table and next symbol of intput. "Cancel" out terminals on the top of both stacks. Repeat until either (1) both stacks are empty, (2) a parse table is empty, or (3) one of the stacks isn't empty. The latter two conditions are failures. Suppose A -> x | y (where x,y are symbol strings). A grammar is LL(1) iff 1) FIRST(x) and FIRST(y) are disjoint 2) if y derives e, then x doesn't derive a string beginning eith a terminal in FOLLOW(A). Ie: if e in FIRST(y) then FIRST(x) and FOLLOW(A) are disjoint. The conditions can be compressed into: FIRST(x FOLLOW(A)) and FIRST(y FOLLOW(A)) are disjoint. A grammar with left-recursion or which admits left-factoring is not LL(1), but can be converted into one which might be LL(1) -- or not. ----------------------------------------------------- Generating LR(0)/SLR(0) grammars: Let '.' denote our position within the rule. If A -> aBc is a rule, then we have associated items to A: A -> . aBc A -> a . Bc A -> aB . c A -> aBc . We want to make an automaton that uses collections of items (item sets) to determine where we might be in a derivation. We build item sets using two procedures: CLOSURE and GOTO. Suppose I is an item set, and X some symbol (terminal or variable). CLOSURE(I) add I's elements to CLOSURE(I) Until no new elements are added: for each rule A -> a . Bc in I, where A,B are non-terminals add B-> . B_1 | ... | . B_n to CLOSURE(I), where the B_i are B's productions. GOTO(I,X) If A -> a . XBc is in I, then add A -> aX . Bc to GOTO(I,X) To build the automaton, find the closure of the start symbol, and compute all goto sets and their closures until no more are added. We need to build the canonical item sets. In pseudocode: G' is the grammar G with an augmented start rule S' -> S items(G') C = { CLOSURE({ S' -> S }) } repeat for each set of items I in C for each grammar symbol X if GOTO(I,X) is not empty add GOTO(I,X) to C until no new sets of items are added to C in a round return C In processing, we build up a stack of our working area, and reduce according the the stack contents. In SLR(0), no shift/reduce or reduce/reduce conflicts are allowed: shift/reduce: if B -> a . and A -> a . X b are in the same item set, then if B's FOLLOW set contains X, there is a conflict. If A -> a . and B -> b . are in the same item set and their FOLLOW sets aren't disjoint, there is a conflict. Example: If we have S -> A_1 | ... | A_n A_1 -> B_1 | ... | B_m the first state of our automaton is S -> . A_1 | ... | . A_n A_1 -> . B_1 | ... | . B_m ... SLR(k) grammars can resolve conflicts by looking ahead $k$ tokens. LR(k) grammars LR(0) grammars must decide whether to shift or reduce solely based on the working-area stack contents. ---------------------------------------------------- Static Scoping: The scope of the a variable is the innermost scope where that identifier was defined. Functions declared outside a class have the scope of their definition, regardless of where they are invoked. Functions declared inside a class call first members inside their present class, then their super class, then super super class, ..., then the scope of the method's invocation, etc. With multiple inheritance (eg: C++), either (1) the parent classes don't share an identifier, so the reference is unambiguous, or (2) the parent classes share an identifier, so the reference needs to be annotated (ie: Base1::foo). Implemented using spaghetti stacks -- a singly linked list where the head of the list is the top of the stack. Dynamic Scoping: Keep a stack of environments at runtime and resolve identifiers by traversing the environments until a match is found, or report an error (think Scheme).